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Abstract 


Quantization is a popular technique to reduce communication in distributed optimization. Motivated 
by the classical work on inexact gradient descent (GD) fi], we provide a general convergence analysis 
framework for inexact GD that is tailored for quantization schemes. We also propose a quantization scheme 
Double Encoding and Error Diminishing (DEED). DEED can achieve small communication complexity 
in three settings: frequent-communication large-memory, frequent-communication small-memory, and 
infrequent-communication (e.g. federated learning). More specifically, in the frequent-communication 
large-memory setting, DEED can be easily combined with Nesterov’s method, so that the total number 
of bits required is O(/« log 1/e), where O hides numerical constant and log « factors. In the frequent- 
communication small-memory setting, DEED combined with SGD only requires O(K log 1/e) number of 
bits in the interpolation regime. In the infrequent communication setting, DEED combined with Federated 
averaging requires a smaller total number of bits than Federated Averaging. All these algorithms converge 
at the same rate as their non-quantized versions, while using a smaller number of bits. 


1 Introduction 


There is a surge of interest in distributed learning for large-scale computation in recent decade 2}/9}. In the past 
few years, new application scenarios such as multi-GPU computation (10}{14), mobile edge computing 
and federated learning have received much attention. These systems are often bandwidth limited, and 
one important question in distributed learning is how to reduce the communication complexity. 


A natural method to reduce the communication complexity is to compress the gradients transmitted 
between the machines. A host of recent works proposed to quantize the gradients and transfer the quantized 
gradients to save the communication cost (10}{14). These gradient quantization methods are successfully 
applied on training large-scale problems, and are shown to achieve similar performance to the original methods 
using less training time (i3}/14). Nevertheless, their theoretical properties, especially the relation with their 
un-quantized versions, are not well understood. A recent work proposed DIANA and proved its linear 
convergence rate in a large-memory setting. Another work proposed DORE which converges linearly to 
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a bounded region in a small-memory setting. Nevertheless, these works often study a specific distribution 
optimization problem, and do not directly apply to other settings. For instance, they assume frequent 
communication, thus do not immediately apply to infrequent-communication due to the additional error 
caused by the infrequent updates. While it may be possible to further adapt these works to new settings, one 
might still wonder whether there is a unified and principled method to design quantization methods. 


Our work is motivated by the classical work on inexact gradient descent (GD) fi], which provides a unified 
theorem for inexact GD and covers a number of algorithms (including SGD). We provide a general analysis 
framework for inexact GD that is tailored for quantization schemes. We also propose a quantization scheme 
Double Encoding and Error Diminishing (DEED). We summary our contributions below. 


e General convergence analysis. We provide a general convergence analysis for inexact gradient 
descent algorithms using absolute errors in encoding. This can potentially cover a large number of 
quantized gradient-type methods. 


e General quantization scheme. We propose a general quantization scheme Double Encoding and 
Error Diminishing (DEED). This scheme can be easily combined with existing optimization methods, 
and provably saves bits in communication in three common settings of distributed optimization. 


e Improved communication complexity. In the most basic setting of large-memory frequent- 
communication, our theoretical bound of DEED-GD (apply DEED to GD) is at least F times better 
than existing works, where F is the number of bits representing a real number. 


The motivation and details of our general convergence analysis and general quantization scheme will be given 
in Section [2] and Section [8] Discussions of related work are in the appendix. We now summarize the results 
of our general quantization scheme in three common settings. 


Frequent-communication large-memory. We propose an algorithm DEED-GD and show it converges 
linearly while saves communication in bits. We further combine DEED-GD with Nesterov’s momentum to 
obtain an accelerated version called A-DEED-GD, which achieves the state-of-art convergence rate and save 
the most total number of bits in communication as shown in Table 


Frequent-communication small-memory. We adopt the Weak Growth Condition from and prove 
that our algorithm DEED-SGD converges to the optimal solution at a linear rate. We compute the total 
number of bits to achieve a certain accuracy for both our algorithm and other works. The comparison is 
presented in Table [2] 


Infrequent-communication. We propose DEED-Fed and provide the first explicit bound on the number 
of bits to achieve a certain accuracy in Federate Learning under non-i.i.d assumptions. Our results can be 
applied to both large-memory and small memory settings and both full-participant and partial-participant 
settings. To our best knowledge, is the only work that also provide a convergence rate for infrequent- 
communication setting under realistic assumptions. However, due to the limitation of their framework, they 
did not do quantization in broadcasting step, which results in a great waste of communication. Our algorithm 
could save up to FdN bits per E iterationg!| 


1Please check Table[1] for definition of F,d and N. 


Algorithm Iterations | Bits per iteration Total bits 

DEED-GD O(k log +) O(dN) O(dN x log +) 
1) O(dN) O(dN Vr log +) 

) O(dNC) O(dNCk log +) 
4) O(dNC) O(dNC\/k log +) 

1) O(dN) O(dN x log +) 

O(dNC) / 
O(K log 4) O(dNC) O(dNCk log 4) 


Table 1: Summary of our theoretical results in minimizing a strongly-convex function as in large-memory 
setting with N computing nodes. We denote the condition number of the problem as «K and the problem 
dimension as d. O(:) omits log «x and constant terms. For algorithm DIANA, ADIANA, DQGD and 
QSVRG, because they didn’t do double quantization, they can choose either broadcasting (fully connected 
network) or transmitting full gradient from center to workers (star network). C = N in the first case, and 
C = F otherwise, where F is the number of bits representing a real number. DQGD cannot converge to the 
optimal solution, and is not directly comparable to our result; so we use / in the iteration cell. 


Algorithm Iterations | Bits per iteration Total bits 
DEED-SGD (WGC) | Õ(«log 1) O(dN) O(dN log +) 
O(S) O(dNC) O(dNCS) 
/ O(dN) / 
O(k log +) O(dNC) O(dNCk log +) 


Table 2: Summary of our theoretical results in solving problem [in small-memory setting. All notations are 
the same as table 


2 General Analysis Framework 


2.1 Motivation 


Our work is inspired by the classical work by Bertsekas and Tsitsiklis |1| which provides a general convergence 
analysis of inexact gradient descent methods. Since quantization also introduces error to the update direction, 
a natural idea is to apply the general framework of |1| to design and analyze quantization methods. However, 
directly applying may not provide the best result, because in quantization methods, we have a rather 
strong control of the “error” in the algorithm. This situation is different from the worst-case or random error 
considered in [1]. Our idea is to develop a modified analysis framework that can accomodate the extra freedom 
of controlling the quantization error, so as to obtain stronger results compared to directly applying fi}. We 
hope that such a general frame work can help us design and analyze quantization algorithms for different 
settings in a unified manner. 


A key element in this analysis is to use absolute error in quantization. Many theoretical works on 
quantization methods do not explicitly consider absolute error, but focus on relative erroy"| (13|[24]/25]. These 


?The definition of quantization by absolute error a is in Definition [2.1] The definition of quantization by relative error a is 
similar. We just need to replace a in the RHS of the inequality in Definition 2.1] by Ter: 


two types of errors are equivalent in one step quantization and only differ by scaling, but not equivalent in a 
multi-iteration convergence analysis. In our framework, we use absolute error in quantization so that we can 
sum up the absolute error over iterates and control the rate. 


Overview of our general analysis framework. We first discuss why we use absolute error in 
quantization, and then we provide a general analysis of convergence rate. Our general analysis is not limited 
to algorithms that perform quantization on gradient or gradient difference. Algorithms that do quantization 
on weights or combination of weights and gradients are also covered in our framework. In this general analysis 
framework, the key component is “effective error” which we define as the error occurred at the weight. For 
example, in frequent-communication large-memory setting, the effective error is absolute error times the 
learning rate. We show that an algorithm converges only if the effective error diminishes to zero. In addition, 
we establish the convergence rate in terms of the learning rate and the effective error. As promised, this is a 
general framework, so it should be applicable to new settings with similar proofs as we will show in Section 


3.4] 


2.2 Content of the Framework 


We consider a star-network where there are N computing nodes and one central node. Suppose we want to 
minimize a function f : Rf — R decomposed as 


1 N 
fw) = = fw), (1 


where each function fi is held on the i-th machine (or computing node), i = 1,..., N. We assume each fi 
is Z-smooth and f is u-strongly convex. We define the condition number « := The formal definition of 
L[-smoothness and strong convexity are standard, so are given in the appendix. 


Definition 2.1. Absolute error encoding-decoding procedure. An a-encoding-scheme of a vector w 
consists of an encoding algorithm E : S x = — Z* and a decoding algorithm D : Z*+ — Rt, where S is the set 
of vector we need to quantize, = is the set of random seeds, Z* is the set of positive integers, and R stands 
for the real domain. We assume: 


e Unbiased coding, i.e. Eẹ[|D o E(w,€)| = w. 


e The absolute error is bounded by w, i.e. ||D o E(w,£)— wl|| <a. 


Besides, the number of bits of this procedure is Eg[log |E(S,€)|]- 


The lemma below gives an upper bound and lower bound number of the bits with given precision. 


Lemma 2.2. Given a set S = {x € R4| ||w||2 < M}, any (random) quantization algorithm that encoding a 
vector in S by absolute error o takes at least | dlog, +] (in expectation) number of bits, where €e = < . In 
addition, there exists a (random) algorithm that takes only [1.05d + d- logy Hs) bits (in expectation] 


For convenience, we define Q(-,€) as a coding procedure with maximal precision € with corresponding 
encoding and decoding procedure E, and D+. The output vector Q(w,¢) is D, o E(w, £). 


3This bound is pessimistic when ¢ is large. For example, when e > 1, the lower bound is 0, however, we cannot only use 0 
bits because of unbiased property. In this case, we can use sparsity to get lower bits. 


To derive a general analysis for quantized GD in minimizing the problem (i), we consider a general series 
of functions F, : R¢ > R4, t > 0. The definition of F depends on the specific problems. For example, in 
frequent-communication, F;,(w) is defined as the function mapping w: to w;+1 in the ¢-th iteration. The only 
assumption on F; in our framework is F; is a continuous function with Lipschitz constant c < 1 with the 
same fixed point w*. In most cases, this can be easily derived by strongly convexity assumption. 


Assumption 2.3. Fort =1,2,---, F, is a continuous function with Lipschitz constant c, < 1 and denote 
w* as the unique fixed point of all F}. 


Theorem 2.4. Suppose Assumption [2.J holds and {w;} is a sequence generated by 


wet = Fy (we) + et, (2) 
for some chosen initial value wo and e; is a zero-mean random noise depending on the (iteration) history 
and is bounded by ar. Define series C? = = a? TL c and D? = 12. Then we have 

i=0 9 j=it1 i=0 
l [lwr — w*|/?] < Dz ||wo — w* ||? + CF. (3) 


In addition, there exists functions series {Fi}i>0 and noise {e¢}1>0 to make the inequality hold. Besides, if 
we suppose the sequence of the Lipschitz constants {c;} is non-decreasing, then the right hand side of g) 
converges linearly if and only if all c¢’s are always bounded above by a constant c < 1 and az converges to 0 
linearly. 


Remark. We leave the deterministic version of Theorem [2.4] in the appendix. It can be useful in proving 
the convergence of deterministic algorithm. 


According to Theorem |2.4| to make {w;} converge to w*, we need both D, and Cy converge to 0. In 
frequent communication setting, the Dg — 0 implies the summation of learning rate diverges. Then Ck — 0 
implies the effective error converges to 0. 


The last statement of theorem [2.4] implies that for any quantized GD algorithms under our framework, we 
should take constant learning rate and linearly decreasing absolute error for linear convergence. 


3 Application of DEED in Three Settings 


Based on Theorem|2.4| we notice that using diminishing error in each iteration can guarantee fast convergence. 
However, according to lemma [2.2] the maximal norm of the vector we want to quantize should also be 
diminishing, otherwise the number of bits may explode. To avoid explosion, we choose to quantize on gradient 
difference instead of gradient. The intuition is that ||V fi(w:+1) — Vfi(we)|| < L\|we+1 — wel] who goes to 
zero as the iterate sequence converges. Finally, to save the communication in broadcasting, we perform 
quantization both on the computation nodes and the center node, i.e. “double encoding”. We name our 
general quantization scheme as Double Encoding and Error Diminishing (DEED). 


Based on the general quantization scheme DEED, we introduce algorithms for three common settings in 
distributed optimization for Problem (i): frequent-communication large-memory, frequent-communication 
small-memory, and infrequent-communication. Frequent-communication means the every computing node 
communicates with the center node after every update, while this is not the case in infrequent-communication. 
In large-memory setting, each local server f; has enough memory to hold its data and use them to compute 
the full-batch gradient of f;. In limited memory setting (e.g. only one GPU is available in computing) that 
each server is only able to compute the stochastic gradients since the data cannot be fit into one server. 


3.1 Frequent-communication large-memory setting 


We distinguish the large-memory setting and small-memory setting for the following reasons. 


First, from a practical side, different system designers have different memory budget. Some big companies 
can perform computation using 10,000+ GPUs or CPUs (e.g. [26}{28)), while most researchers and companies 
can only use few GPUs or a moderate number of CPUs. The problems they are facing are indeed different, 
since in large-memory setting we can implement full-batch GD al (or large-batch SGD which are quite close 
to full-batch GD). Note that “large” is a relative term; if the system designer has only 10 CPUs or even 2 
GPUs, but all data can be loaded into the memory of these machines, then this is also a large-memory setting. 
In a small-memory setting, we can only load a mini-batch of the dataset into one machine at a time. This 
necessitates the usage of Stochastic Gradient Descent (SGD). 


Second, from a theoretical side, quantized gradient methods should be no better than gradient methods 
that utilize infinite-bandwidth. To judge the performance of quantized gradient methods, one useful metric is 
the gap between quantized methods and their non-quantized counter-parts. It is impossible to prove linear 
convergence of quantized SGD in the limited-memory setting without further assumptions, since even with 
infinite bandwidth SGD cannot achieve linear convergence rate . In contrast, with infinite bandwidth GD 
can achieve linear convergence rate. Due to different upper limits, large-memory and small-memory settings 
should be treated separately. 


The frequent-communication large-memory version of DEED is described in Algorithm [I] 


Algorithm 1: Double Encoding and Error Diminishing Gradient Descent (DEED-GD) 
Initialization: Each server i € [N] holds wo = s+; = v_1 = 0, server 0 holds v_1 = 0, k = 0; 
Hyper-parameters: n € Q a); c=1—np, c € (c,1); parameter s € R4; 
while the precision is not enough do 

for i € [N] do 


server 7 computes gi = V fi (wx); 
ik+1 
SC 


server i does quantization dj, = Q(g}, — sj,_1, *3—); 
. i _ yi io. 
server 7 updates sy, = di, + S15 
server 7 send dj, to server 0; 
end 
ENa 
— 2 i 
server 0 computes Sk = 7 2 dj, + Sk—1; 
iz 
eel htt 


server 0 does quantization ug = Q(Sk — Vvk-1, S—); 
server 0 sends ux to server i, Vi € [N]; 


server 0 updates Vk = Uk + Vk—1; 
for i € [N] do 
server 7 updates Vk = Uk + Vk—1; 
server 7 updates Wk+1 = Wk — NUK; 
end 
k=k+1; 
end 


“Disclaimer: we discuss large-memory setting mainly due to theoretical interest. We do not run simulation on a large number 
of GPUs, though we will mimic the large-memory setting. 


3.2 Frequent-communication small-memory 


Now we consider the small-memory setting with frequent-communication. As mentioned earlier, without 
extra assumptions, it is impossible to prove linear convergence rate of vanilla SGD. 


There are two lines of research that can prove linear convergence of SGD-type methods. Along the first 
line, a few variance-reduction based methods such as SVRG [80], SAGA and SDCA can achieve linear 
convergence. Along the second line, with extra assumption such as WGC (Weak Growth Condition), 
vanilla SGD with constant stepsize can already achieve linear convergence 21]. This line of research is strongly 
motivated by the interpolation assumption in machine learning that the learner can fit the data, which is 
considered a reasonable assumption in recent literature (e.g. (21][33}). Therefore, we focus on designing 
quantization algorithms along the second line. 


N 
Assumption 3.1. (WGC Assumption le Suppose f : R — R, f(x) = + >> fi(w) is the objective function. 


Stochastic “functions (algorithms) {V.}ieiwy satisfy WGC if 1) E[Vi(f:,w)] = Vfilw), Vi € [N],w € R3; 2) 
+ D Ey, [IVifi(w)ll?] < 26L(F(w) - F(w*)). 


Jz 


To adapt the frequent-communication small-memory, we introduce DEED-SGD. The only differences 
between DEED-GD and DEED-SGD are 1) we use V; instead of the accurate gradient; 2) we use different 
quantization level. The full description of DEED-SGD is given in the appendix. 


3.3 Infrequent-communication 


A main area in distributed optimization with infrequent-communication is Federated Learning (FL), which 
involves training models over remote devices or data centers, such as mobile phones or hospitals, and keeping 
the data localized due to privacy concern or communication efficiency (17/34). In FL, some computation nodes 
might no have full participation in the updates and the data sets are non-iid. We remark that infrequent- 
communication is a generic design choice, and can be used in a data-center setting as well. Although existing 
works like QSGD do not explore this degree of freedom, infrequent communication can be combined with 
QSGD as well. Nevertheless, the theoretical benefit of the combination was not understood before (partially 
because the total number of bits was not a focus of previous works, and linear convergence rate was derived 


only recently |19]). 


A classical algorithm in FL is Federated Averaging algorithm (FedAvg) which performs local stochastic 
gradient descent on computation nodes for every E iterations with a server that performs model averaging [15]. 
Although there have been much efforts developing convergence guarantees for FedAvg, 35H41], there is 
relatively scarce theoretical results on the combination of FedAvg and quantization . [2238| either 
make unrealistic assumptions or only perform quantization on computation nodes, and thus they are not 
efficient as our double encoding scheme. 


We propose an algorithm called DEED-Fed. The difference between DEED-GD and DEED-Fed is that in 
FEED-Fed, the maximal error at iteration k is proportional to learning rate ng. Due to space limitation, a 
detailed comparison between the three proposed algorithms are given in the appendix. 


3.4 Theoretical Analysis 


In this section, we give the computational and communication complexity of the algorithms DEED-GD, 
DEED-SGD and DEED-Fed. Since all of them are in the same framework, their proofs and results are similar. 


We put it into one single theorem. 


Theorem 3.2. Consider solving Problem under one of the three settings by the corresponding algorithms 
(DEED-GD, DEED-SGD or DEED-Fed). Assume fi is L-smooth and f is u convex. Assume all f;’s are u 
convex in DEED-Fed. Denote w, as the iterate at iteration t and w* is the optimal solution of Problem (i). 


For DEED-GD, we choose the learning rate m = = c:=1—np,c<dc <1, and the maximal error 


2 
LIR? 
at iteration t is sett! /2 where s is the quantization level. 

For DEED-SGD, we assume (Weakth Growth Condition) WGC is satisfied for approximate gradient V; 


or every fi with parameter p. We choose the learning rate m =n = c:=1- c< <1, and the 
y ith p ter p. We ch the l ing rate n 7 1— np, ‘<1, and th 


maximal error at iteration t is V sc'*t'/2 and the error is unbiased. 


1 
pL’ 


For DEED-Fed, we choose the learning rate m := Ws for some p > D y > 1 such that no < z+ and 


m <2m4e. Let the maximal error at iteration t € {0, E,2E,---} be sn. 
We have the following results: 


e DEED-GD communicates O(Nd) bits at iteration t > 1, and 
lwi — w*< (e)! ( max fo, lwo- wl- E \ ou ) 
e DEED-SGD communicates O(Nd) bits at iteration t > 1, and 
iw: — w" ||? < (¢)* (max fo, lwo- w*||? ae \ | or) l 
-c d—e 
e DEED-Fed communicates Õ(Nd) bits at iteration t € {E,2E,---}, and 


U 
y+t 


||, — w* ll? < 


where v is some constant dependent on the Federated learning settings (e.g. full participant or not) as 
2 


well as the the initial error ||wo — w*| 


Remark 1: Based on these results, we can easily compute the total number of bits needed to achieve a 
certain accuracy; see Table [I] and Table [2| 


Remark 2: Our result allows to trade-off communication time and computation time. By changing the 
parameters c’ and s, we can find optimal choice of convergence speed and error size. 


Theorem [3.2] implies that to achieve ||wr — w*|| < €, we need O(« log +) iterations for DEED-GD and 
DEED-SGD and Ol) iterations for DEED-Fed. These convergence rates match those of the corresponding 
algorithms with infinite bandwidth. Due to space limitation, we eliminate the detailed definitions of some 
constants in T heorem [3.2] and we will provide the details in the appendix. 


4 Quantization of Nesterov acceleration 


In frequent-communication large-memory setting, we combine Nesterov’s acceleration with our quantization 
scheme DEED. The accelerating algorithm is very similar to Algorithm The only difference between 
Algorithm |1] and this accelerated version is that we add momentum in the final update step. The full 
description of the algorithm is given in the appendix. 
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Figure 1: Compare our algorithms with other works on Linear Regression Problem. 


Theorem 4.1. Consider solving Problem [1] by Algorithm A-DEED-GD and assume each f; is L-smooth 
and f is -strongly convex . Let the learning rate n = + the constant c := 4/1 — VE such thate < d <1, 


and we do quantization with maximal error sc**" /2 at every iteration k. Then we have: 
yY 


(1) [ek — =*|| < f2-e" . Vf Are +C, where C = B? + as + Bas — A, Qs = TÈT +A, bs = 
3,/245,/2 
(ea) sy, y=ec/e>1. 


e(y?—1) 
(2) The number of bits at iteration k > 1 is O (Nd). 


Theorem [4.1] implies that we can improve the linear convergence rate in frequent-communication large- 
memory setting from O(«) to O(./K) with acceleration trick, where « is the condition number of the objective 
function. This also provides a fewer total number of bits in communication. 


Remark 1: We separate algorithm A-DEED-GD out from previous three algorithms because we cannot 
directly use theorem [2.4] due to the momentum. But the intuition and technique are very similar. Hence we 
put it in the DEED series. 


Remark 2: We noticed an independent work which also proved an accelerated rate, but differs 
from our work in the following aspects. First, their encoding scheme is a non-trivial combination of the 
Nesterov’s momentum and DIANA (which is why a separate paper is written), while our combination is 
rather straightforward. Second, their bound has an extra constant dependent on the communication scheme 
(can be N or number of encoding bits) while our bound does not. Third, our work aims to develop a general 
framework, and acceleration is just one case; while their work focused on acceleration in large-memory 
frequent-communication setting. 


5 Experiment 


Linear regression. We empirically validate our approach on linear regression problem as shown in Figure [I] 
The solid lines correspond to gradient descent type of algorithms and the dashed lines correspond to the 
accelerated versions. In the left figure, the curves of our method coincide with the curves of the un-quantized 
baseline methods (GD and A-GD). QSGD performs the worst since it uses constant error. In the right 


figure, our accelerated method achieves high-accuracy solution with the fewest number of bits, and another 
state-of-art algorithm takes more than 6 times of bits than ours to reach the same accuracy. Even without 
acceleration, our algorithm (DEED-GD) takes fewer bits than A-DIANA. Overall, our algorithms save the 
most number of bits in communication without scarifying the convergence speed. 


Image Classification. We also compare our algorithm with other state-of-art algorithms (e.g. QSGD, 
TernGrad, DoubleSqueeze, DIANA) on image classification tasks on MNIST data set [42]. The results 
still show that our algorithms outperform others both in terms of convergence speed and communication 
complexity. The details of the two experiments are provided in the appendix. 


6 Conclusion 


In this paper, we provide a general convergence analysis for inexact gradient descent algorithms using absolute 
errors, that is tailored for quantized gradient methods. Using this general convergence analysis, we derive 
a quantization scheme named DEED and propose algorithms for three common settings in distributed 
optimization: frequent-communication large-memory, frequent-communication small-memory, and infrequent- 
communication (both large-memory and small-memory included). We also combine DEED with Nesterov’s 
acceleration to provide an accelerated algorithm A-DEED-GD for frequent-communication large-memory, 
which improves the convergence rate from O(k) to O(,/«). Our proposed algorithms converge almost as fast 
as their non-quantized versions and save communication in terms of bits. We empirically test our algorithms 
on linear regression problems and image classification tasks, and find that they use fewer bits than other 
algorithms. 
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A Outline, Definition and Related Work 


In the appendix, we first discuss some common definitions and related work. In Section [B] we provide the 
proof appeared in the general convergence analysis in Section [2] In Section [C] we give the detailed description 
of the three algorithms based on our framework DEED. In Section [D] we provide the proof of the theorems 
for DEED in the three settings. In Section [F| we illustrate the efficiency of our algorithms in linear regression 
problem and image classification tasks on MNIST dataest. 


Definition A.1. A differentiable function h : RÌ — R is L-smooth if Vx,y € R$, 
L 
Ih(y) — A(x) — (ya, Va(@))| < zle- ll? (4) 
Definition A.2. A differentiable function f : R > R is pi-strongly-convex if Vx, y € R4, 


f(y) — f(@) — y— 2, VF(@)) = Fle — yl? (5) 


N 
Definition A.3. Suppose f : R? > R,f(w) := x © fiw), where fi is L-smooth Vi € |N] and f is 
i=1 


L 


u-strongly-convex. Then the condition number k of this collection of functions is defined as k = ve 


Assumption A.4. Assume that in f is u-strongly-convex and fi is L-smooth Vi € [N]. 


Related work. The study of the communication complexity in terms of bits for convex minimization of 
the problem can be traced back to a classical work in 1987. This work focuses on the two-nodes case 
for frequent-communication large-memory setting, and proposed a nearly optimal algorithm using quantized 
gradient differences. For multiple-nodes cases, provides a linear convergence rate also using gradient 
differences on computing nodes. In frequent-communication small-memory setting, to save more bits in 
communication, consider double encoding on gradient differences. In infrequent-communication 
setting, uses quantized gradient differences and proves sublinear convergence rate. Our work is different, 
as we provide convergence analysis on three settings for multiple-node cases and use doubling encoding to 
save more bits. 


B Proof of Theorems for General Convergence Analysis 


Lemma B.1. Given a set S = {x € R4| lwll < M}, any (random) quantization algorithm that encoding a 
vector in S by absolute error o takes at least | dlog, +] (in expectation) number of bits, where €e = < . In 
addition, there exists a (random) algorithm that takes only |1.05d + d - log, +#2£| bits (in expectation). 


Proof sketch. For the lower bound, we only need to prove the deterministic version since every random 
algorithm can be reduced to a deterministic algorithm by fixing €. Then it is equivalent to cover S with small 
balls and proof follows. And we will use constructive method to prove the upper bound. 


Proof. We first prove the lower bound. Ym € E(S$), construct a ball centered at D(m) with radius ø. Then 
all these balls form a cover of S. Otherwise there is a vector v € S outside the cover and 


—D(E > i -D >a, 
lv (E(v))|lo mEE(S) lv (m)l >o 
which contradicts to the assumption. 
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Hence, the sum of the volumes of these small balls is not less than the volume of S. Finally, |E(S)| > u 
follows. 


On the other hand, we divide the whole space by cubes with side length 2z regularly. Then select the 
cubes who have non-empty intersection with S. Then every point in S is contained in a cube, which means it 
can be encoded (unbiasedly) by the vertices of the cube with maximal error ø. Then these cubes must be 
contained in ball B(0, M + 2c) since the diameter of the cube is 2ø. In this case, the number of cubes is at 


d 
=? (M+20)? ‘ p 
d d 
most = T , and then the number of bits is at most ow, si) (cH) l ; 
% ] 


Recall Stirling’s formula T(n + 1) > V2mn (%)", we have 


a {as20Vva\" = tsaa 
eg ) s maz rl 


2e 


A 
a 
= 
© 
08 
N 
3 
wale 
m al 
m I+ 
NO) 
M 


1+2 
< 1.05d+ d- logs a 


Theorem B.2. Suppose Assumption [2.] holds and {w+} is a sequence generated by 
wee = Fe(we) + et, (6) 


for some chosen initial value wo and e; is a zero-mean random noise depending on the (iteration) history 
k-1 _ k-1 k—1 

and is bounded by as. Define series C? = Y` a? J] È and D? = J| c?. Then we have 
i=0  j=i+1 i=0 


l [lwr — w*|/?] < Dz l|wo — w* ||? + Co. (7) 


In addition, there exists functions series {Fi}i>0 and noise {e}1>0 to make the inequality hold. Besides, if 
we suppose the sequence of the Lipschitz constants {c;} is non-decreasing, then the right hand side of {7 
converges linearly if and only if all c¢’s are always bounded above by a constant c < 1 and az converges to 0 
linearly. 


Proof. The inequality is straightforward. 


o [lwe — w*||?|we] = E [|| Fe(we) + ee — w*||?|wr-s] 
= E[||Fi(@w:) — w* |? lwr-1] + oF 


< Êw- w*||? + 02. 


Then we can prove inequality (7) by mathematical induction. For T = 0, we have ||wo —w*||? = ||wo —w*||?. 
Suppose it holds for T < k, we have 


IA 


cZE||w, — w* ||? + a2 


2 (Dèllwo — w*||? + CZ) +a? 


Blw — w* ||? 


IA 
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= Drsillwo — w*||? + Chr. 


The inductions succeed. 


Next, we define F;(x) := cx. Given history {wo, w2, :, we}, we assign €+ as an arbitrary vector orthogonal 
to F,(w,;) — w* with length a,. Define e, = €; with probability 4 and ep = —é; otherwise. In this case, we 
always have 


and 


llega = w" |? = llw- w”? + o, 


wep — w* ||? = Dezillwo — w* ||? + Chr 


follows. 


Finally, suppose the sequence of the Lipschitz constants {c;} is non-decreasing. 


e Necessity. Suppose there exists constant C, M > 0 and c € (0,1) such that D2 ||wo — w* ||? +C?. < Cc?, 


C 


YT> M. 


We firstly prove a lemma. 


1 


In == 
IE fog . 
= is increasing on (0,1). 


Lemma B.3. The function g(x) := 


T 
Proof. Vx € (0,1), we have g'(x) = te SG, 


x 


n+ In 4 1 In + 


1 pay 

Recall c, > co, Vt > 0, we have rae = i.e. In< < Co(1 — c) where Co := Te Because 

Ti ‘112 T-1 #12 
DA < more we have Ù 2In2 >In ig 47 ln +. Moreover, 2Co Ð (1—c;) > In lwow" 

i=0 i=0 

S aci) r , , : 
Tln L, This suggests Vk > 0, jim = = Tea and 1 — ck > sc follows. On the other hand, 
00 


Int osat P 
"e and a; diminishes exponentially. 


CT > Ce > OF 4s Hence, c; is bounded by c := 1 — 2, 


Sufficiency. Suppose c; < c < 1,Vi > 0 and qm < Ca‘, Vt > M for constant C > 0, M > 0. Without 


loss of generality, we assume M = 0. We only need to prove Dz and Cp diminish exponentially separately. 
k-1 k-1 

It is trivial for Dy. For Cp, we have CZ := CZ = > a? J| cf < CkB* where 6 = max{c,a}. Then 
i=0  j=i+1 

C;, diminishes exponentially since Ck8* < B 2 for sufficient large k. 


Algorithms 


C.1 DEED-GD 


The pseudo-code of DEED-GD is given in Algorithm [I] 
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C.2 DEED-SGD 


The algorithm of DEED-SGD is given below (algorithm [2). As we promised previously, there are only two 
differences between DEED-GD and DEED-SGD. 


e In line 5, we use approximate gradient V; instead of gradient V. 


e In line 2, line 6 and line 11, we use different 7, c and c’. And the maximal error in quantization is 


changed. 


Algorithm 2: Double Encoding and Error Diminishing for Stochastic Gradient Descent (DEED- 


SGD) 


1: Initialization: Each server i € [N] holds wo = s+; = v_1 = 0, server 0 holds v_; = 0, k = 0; 
2: N= Ae c=1- rae c< < land s € R, is the quantization level; 
3: while the precision is not enough do 


ha 


18: 


for i € [N] do 
server i computes gi = V1: fi (xx); 
server i does quantization di, = Q(g} — si,_,, V sc"? /2); 
server i updates si, = di, + si_4; 
server i send di, to server 0; 
end for 


N 
server 0 computes sz = a YS di, + Sr-13 


server 0 does quantization ug = Q(sp — vk-1, V sœ! /2); 
server 0 sends ux to server i, Vi € [N]; 
server 0 updates Vvk = Uk + Vk—1; 
for i € [N] do 
server i updates Vk = Uk + Uk—1; 
server 7 updates Wk+1 = Wk — NUK; 
end for 
k=k+ l; 


19: end while 


C.3 DEED-Fed 


N 
We first introduce the original FedAvg algorithm. Define F(w) := X` piF;(w) where F;’s are -convex and 
i=1 


N 
L-smooth functions defined on R? and p; > 0, X p; = 1. In each round, say round k > 0, the center server 


t=1 


sends weight w;z to N slave nodes, and the kt! slave nodes performs E local updates (for i in {0,1,-, E —1}): 


k ne k k 
WtE+i+1 = WtE+i T ME+iV Fk (wiE+i» Eie+i) 


Finally, in full participant setting, all slave nodes sends their final weights to the center server, and 
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center server computes 


N 


W(t+1)E t= 5 PiWi) E 
i=l 


For partial participant setting, not all slave nodes stay active in each round. Here are two more 
detailed settings. 


1) In setting 1, we defined a set S41 of K indices selected randomly with replacement from [N] with 
probability distribution (p1,--- ,pn). Then the center server updates 


1 i 
WE+1)E ‘= K 5 W(t+1)E' (8) 


iESt+1 


2) In setting 2, we defined a set 5:41 of K indices selected evenly and randomly without replacement from 
[N]. Then the center server updates 


N a 
were =F Dd) Pile (9) 


iE St41 


Except the assumption of smoothness and convexity, there are two more assumptions. 


Assumption C.1. Let €* be sampled from the k device’s local data uniformly at random. The variance of 
stochastic gradients in each device is bounded: E||V Fy (wt, €E) — V Fp (w¥)||? < o2. 


Assumption C.2. The expected squared norm of stochastic gradients is uniformly bounded, i.e. 
|V Fa (we, EP)? < G?. 


Based on these two assumptions, their theorems said 


Theorem C.3. With the algorithm above, we have 


[Ta — w" |? < (1 — mE — w* ||? + (B +C), 


N 
where B = Ù po? + 6LT + 8(E — 1)?G?, and C is a constant depending on different setting. In full 
k=1 


participant setting, C = 0. In partial participant setting 1, C = ZE°C?, and in partial participant setting 2, 


C= YKKAR. 


The algorithm DEED-Fed is a simple combination of DEED-GD and Federated Averaging algorithms. 
Algorithm [B]is the pseudo-code of fully-participant setting. 


To change algorithm [3] into partial participant versions, we only need to replace [N] by set S in line 9 and 
change the summation in line 14 into or (9) correspondingly. 


D Theorems for DEED in Three Settings 


In this section, we restate the theorem [3.2] separately and give proof separately, too. 
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Algorithm 3: Double Encoding and Error Diminishing Federated Averaging (DEED-Fed) 
1: Initialization: Each server i € |N] holds wo = st p = v_z = 0, server 0 holds v-p = 0, k = 0; 
2: Hyper-parameters: 7, E€ R4}, parameter s € R,; 
3: while the precision is not enough do 
4: for ice |N] do 


5: Why = We — V filwh Eh); 
6: k=k+l1; 
7: end for 
8: if E|k then 
9: for i € [N] do 
10: server į does quantization dj, = Q(w;, — si _ p, “); 
11: server 7 updates s} = dj, + S5} p; 
12: server 7 send dj, to server 0; 
13: end for 
N > 
14: server 0 computes sk = )> pid}, + Sk-E; 
i=l 
15: server 0 does quantization up = Q(s% — vk- p, “Æ ); 
16: server 0 sends ux to server i, Vi € [N]; 
17: server 0 updates vk = Uk + Vk- E; 
18: for i € [N] do 
19: server 7 updates vz = Uk + Uk- E; 
20: server 7 updates Wk = Vk; 
21: end for 
22: end if 


23: end while 


D.1 Proof for DEED-GD 


Theorem D.1. In algorithm|1, we choose the learning rate m = = c:=1-— nu, c< c <1, and the 


2 
L+p? 
maximal error at iteration t is set" /2 where s is the quantization level. Then algorithm|j| communicates 


O(Nd) bits at iteration t > 1, and 


; a. cns c'ns 
|w: — w*|| < (ec)? (max {0, ||wo — w*|| I- -} + aa -) : (10) 


Proof sketch. First of all, according to mathematical induction and triangle inequality, we see that the 
effective at iteration t is bounded by n- sc’’**. Besides, consider the function F;(w) = F(w) := w — nV f (w). 
Because f is L-smooth and u-convex, we know that F is (1 — nu)-Lipschitz with fixed point w*. Then, 
according to our framework, we can prove that ||w; — w*|| converges at speed c’. This is exactly (10). 


To bound the number of bits produced by each communication, we only need to prove that ||g} — s}_,|| = 
O(c") and ||s; — v4—-1|| = @(¢") Vi € [N] because of lemma|2.2 

Notice that these two norms are all close to L||w: — wy-1||, and |w — wz_1|| can be bounded by 
w: — w*|| + |lw* — wy_1|| which are also O(c") by (10} 
bound these terms carefully to give a tighter bound. 


) and everything’s done. In our real proof, we will 


Proof. First of all, we can give a deterministic version of theorem [2.4] as we promised below theorem |2.4 
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Proposition D.2. Suppose F : R? > R? is a continuous function with Lipschitz constant c < 1. Define a 
sequence {e;};>1 in R® satisfies |le;|| < nsc” where c< d <1. Then Yzo € R¢, the sequence constructed by 
tt41 = F(x) + e141 satisfies D(a) < cl (max fo, D(zo)— aus. + gas) where D(w) := ||w — w*|| and w* 
is the fixed point of F. 


Proof. According to definition, we have the following inequalities. 


D(zk+1) = D(F (te) + er+1) 
< D(F(zx)) + llex+ll 
< c- D(zp) +set. 
Hence, 
D(zxp) D(xp-1) cd f 
ch ck-1 ba c 


Finally, we have D(za,) 
e" (max {0, D(zo) = G4} + S22). 


IA 


cf (D(«o) si) ! c" (gu) which is bounded by 


Then we can prove the inequality (10). Notice VF = I — nV?f where uI < V?f < LI, we see F is 


c:= 1 — np Lipschitz. By proposition | we only need to prove ||vk — ggl] < sct} where gp = Vif (ae). By 
N 


induction, we have Sk- E DE + X dp tsSk-1 + D (dy +s} 1) = Sk-1-# 1 = 81-F ds, =0. 
i=l i=l i=l i=l i=l 
Then, 
Ilex = Gell = lluk +v- — Gell 
< [luk — (se = ve—-1)|| + [sx — gall 
re 
< [luk — (se = vki) + Ip SoG. + sh- — 94) 
i=1 
cen 
< |lux — (sk — ve-1)|| + N 5 |d + sk-1 — k| 
i=1 
< sdt 
For convenience, we define constant s = max fo, \|zo — x*|| gus. \ one. The convergence result comes 


directly from proposition [D.2| 
To bound the number of bits, we only need to calculate the maximal norm of the vector we need to encode. 
Actually, Vi € [N], Vk > 1 we have 


lg: — skll < Ilo: — gll I] Gi sill 
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tk 


sc 
< Llzk— £k-l| + 5 
scl” 
< Lnljvk-1|| + > 
po sd" 
< Lnllgk-1l| + Lnsc™ + > 
tk 
< Lyle, —2*||+ Ense” + > 
k-1 ko se" 
< Dd E, + Lys” + 
E L?né, + Lnsc' + sc k+l 
= E am,” 
Similarly, we have 
lls — Veal] < lsk — Gell + ll9% — Ge-11] + lgx- — vr- ll 
3s" 
< Lek — £k-1l| + 7 
(2s + Ense + =) bel 
Pads d? r 


In this case, the error fraction (length divided by error, i.e. inverse of relative error) is bounded by 
2 $s c +3c! ope % 
(, = Vee eine +8¢ and the number of bits is at most (1.05 + log,(¢, + 2)) d by lemma 


c 


D.2 Proof for DEED-SGD 


Theorem D.3. In algorithm [Qi we assume (Weak Growth Condition) WGC is satisfied for approximate 
gradient V; for every f; with parameter p. We choose the learning rate h =n = c:=l-nyu,c<d <1, 


and the maximal error at iteration t is V sc!**1/2 and the error is unbiased. 
DEED-SGD communicates O(Nd) bits at iteration t > 1, and 
2 Nyt 2 ens c'n’s 
Elw: — w* ||" < (c) | max 4 0, ||wo — w* || + : (11) 


c—c d- c 


1 
pL’ 


Proof sketch. First of all, we introduce theorem 5 in to show that with WGC, stochastic gradient 
descent converges linearly. 


N 
Lemma D.4. Suppose f : R? > R is u-strongly-convex. Besides, f(x) = 4 © fi(x), where each fi is 
i=l 


L-smooth. We assume WGC is satisfied for approximate gradient V; with parameter p. Then series {zi }i>o 
generated by iteration formula 


N 
Tk+1 = Tk — > Visla) (12) 


i=l 


satisfy 


E [eu —2°Pla] < (1-4) er- 2" 
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Being similar to the proof in previous section, we can bound the effective error by nV sc’ k+1 Then we 


have 


ık+1 


£k — T* ||? + n’se 


(+1 — Tk||? < cE 


Then we can put it in our framework. By using the same technique in proposition [D.2] we have inequality 
(13). Finally, we can bound the number of bits by using lemma 2.2] 


Gi — Shll? 
Sk — Vp—1||2 are O(c!”), Vi € [N], because In z? = we is a concave function. By triangle inequality 
Wr-1||? < 2E w* ||? + 2E 


aes As we showed in sketch, we have proved inequality (11 . We only need to prove that E 


and 
and fie definition of Z-smooth, we only need to prove E 
Ole”), which is obvious by inequality (ui). 


D.3 Proof for DEED-Fed 


cena D.5. In algorithm|3, we choose the learning rate m := ie for some B > ae y > 1 such that 
No < gr and h < 2m+n. Let the maximal error at iteration t € {0,E,2E,---} be sm. Then DEED-Fed 
spain ue cties O(Nd) bits at iteration t € {E,2E,---}, and 


— wr |? < v 
y+t 


, (13) 


N 
where v := max { P(BLCHS) Alw w* aN, Here, B = Pik +6LT + 8(E —1)?G?, and C is a constant 


depending on different setting. In full participant setting, C = 0. In partial participant setting 1, C = 4 EG’, 
N- K 4 p2 
REG. 


and in partial participant setting 1, C = 


Proof sketch. Recall the theorem with no quantization in (41). 


Theorem D.6. Assume assumption [C1] and|C.A hold. For FedAvg, we have 


—w" |? +n (B +C), (14) 


T — w"|/? < (1 — my )Elo 


N 
where B = Y` pzoz + 6LT + 8(E — 1)?G?, and C is a constant depending on different setting. In full 
k=1 


participant setting, C = 0. In partial participant setting 1, C = EG, and in partial participant setting 1, 


C= KAP. 


Moreover, with inequality (up. we have E||w, — w*||? < < sy where v := = max { AEO 0 — w*l?}. 


Combine this theorem error analysis, we havd] 


T — w" |? +n (B+C +s”). 


lT — w" ||? < (1 — mu)E 


We can already put the map from W, to W;+1 into our framework, and prove that it converges sublinearly. 
Actually, we can just use theorem|14] and conclude that inequality (13) holds. The only difficulty is the bound 
for communication. We cannot bound |w: — Wiel) by |W: — w*|| + |lw* — Wel] since it is O(1/Vt), while 
the precision is O(1/t). Please see this part in the proof below. 


5We can prove better inequality since the quantization is only done on iteration E,2E,---. However, this is enough since it 
won’t change the convergence rate. 
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Proof. We have proved inequality (13). The only thing left is to prove that lwh — wgl? = O(1/¢). 


Notice that for each larger iteration t > 0 and slave node k, we have 


E-1 
lwe- wiel < 5 wipri — Weil 
i=0 
E-1 
< mp+illVFk(whpss Eps) 
i=0 
Hi 
< hE >; |V F; (wfe Efell 
i=0 
Hence, 
E-1 
|| we +1) 2 = well? < eE ( OV Fk (wfr)? + ok) 
i=0 
E-1 
< nge E 5 (L? S| who —w* |? + o;) 
i=0 
E-1 
< nie EX (QL7Ellwip.; — Tieri? + 2L7E |W: — w* |? + of) 
i=0 
E-1 ü 
< np- E 2 (21Prip nC? + 22> a ot) 


v 
= np E? | 2L’ EG? + á ars ot) . 


With the approximation above, all the vectors we need to do quantization are bounded by 


v 
TE +p E? (erpe? + a +tE T o2) 


in expectation, and proof follows. 


E Algorithm A-DEED-GD and its Convergence Analysis 


A-DEED-GD is the accelerated version in DEED series. The algorithm and its proof are similar to 
DEED-GD. The only difference between DEED-GD is the update rule. Please see algorithm [4] below for 
details. 

Proof sketch First of all, we can use triangle inequality to prove that the error on vz is small, i.e. 
lige — vk|| < sc/**'. Then, with diminishing error we are able to prove linear convergence. Finally, we use 
lemma [2.2] to show that we communicate O(d) bits per communication. The first and the third step are 
exactly the same as DEED-GD. Hence, we only need to prove that the convergence part. 


Proof. Suppose f : R? > R is a -convex and L-smooth function. 


Choose an arbitrary point £o = yo = vo € Rf, we can define ¢0(x) := ġġ + §||x — vo||? where ġġ := f (vo). 
Then by definition we know (x) < f(x) and ¢§ > f (Zo). 
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Algorithm 4: A-DEED-GD 


1: Initialization: Each server i € |N] holds zo = s+; = v_1 = 0, server 0 holds v—ı =0 k =0; 


2: Parameter setting: n = +, T = Se, c= /1-/%; 


3: while the precision is not enough do 


4: 


19: 


for i € [N] do 
server i computes gi, = V fi(yk); 
server i does quantization di, = Q(gi. — si_,, *S— 
server i updates si. = di, + ET 
server 7 send d} to server 0; 
end for 
N 
server 0 computes Sk = 4 I di + sp1; 
= tk+1 
server 0 does quantization up = Q(Sk — vk-1, S—); 
server 0 sends ux to server i, Vi € |N]; 
server 0 updates vk = Uk + Vk—1; 
for i € [N] do 
server i updates Vk = Uk + Uk—1; 
server 7 updates £k+1 = Yk — NVk; 
server į updates Yk+1 = k41 +T(Tk+1 — Tk); 
end for 
k=k+ l; 


20: end while 


Next, we inductively define the following quantity. Suppose £ : Z2° — R* is an arbitrary function. 


1 
Tet1 = Yk— 7 Mk 
H 2 
Prlz) = (1—a)dx(x) +a [ (un) + (me, z = yx) + z lt — yell 
Uk+1 = arg JER? bx4i(v) 
_ ky + QUk+1 
Yk+1 l+a 
k+ = mingr+1 


where a = \/£ and mg € R? such that ||mx — Vf (yx) || < A TH(k). 


Besides, we will construct monotonically increasing functions h, g : ZZ? > R inductively such that 


by + h(k) că 
(1 — c?) f* + c7*do(a*) + glk)”. 


F = 
a, os. 
8 8 
* > 
IA IA 


Obviously, it is appropriate to set g(0) = h(0) = 0. 


Before we go deeper, here is an important lemma. 


Lemma E.1. With the definition above, we have 


Yk+1 = Tei + Teh — Tk) 
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vI-yī 
VL+VE` 


where T = 


Proof. Because (x) = ¢% + £||x— vg||?, taking derivative Vøk+ı (x) = u(1 — a) (x — vk) + amg +au(z — yp). 


And then we get vgqi = (1 — a)uk + aye — gMk- 


Because yk = Eee we can substitute vz = ita Yk — trp and get 
1-a? l-a a 
Uk+1 = Yk Tk + AY Mk 
a a H 
1 1 l-a 
= k Mk Tk 
a Y L Qa 
Tk+1 — Uk 
= St ay. 
a 
Hence 


Lk+1 + QUk+1 


dki = l+a 
= £41 t7(fe41 — Tk). 
Then we have 
Fler) f < dR +h(k)- c- f* 
< &*(go(a*) — f* + g(k) +h(k)) 


F(A + g(k) +h(k)) 
where A := ġo(z*) — f* = f (vo) — f* + £l|zo — 2*||?. Hence, 


: “(F(a — a) 


fi -VA+ glk) + hk 
H 


IA 


[ær — x| 


Furthermore, according to lemma 3 and the monotony of g and h, we have 


lys — 2" | 


[Ze j VL + yE 
(a ae pula A) fe. A +g(k)+ h(k) 


VL+Ji cvL+yu 
ad JATT. 


IA 


And then we can give the first upper bound 


Prila") = (1 a)de(a*) +a [f(on) + (msz = yn) + E lla" — yrl? 
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IA 


(1 = aon (a*) + a [Fyn) + (VF) 2° — y) + E lle" = vel? 
+a Uka" — yu 

(1 a)de(a*) +af* + a k)eta — ye 

2 (1-2) f+ gola") + glk) %) + (1-2) 

+a (kyo a" — yel 

= (i c2k+2) pe + 242 bo (a*) + g(k)- cok+2 


+3&(k) - PRET DEO) 


which means we only need to make 


IA IA 


g(k +1) > g(k) 4 


(16) 


30(h)y/2 - VA FIE) + HR) 


To make an upper bound of h(k + 1), we notice that 


Peo = Pk+1(Ve+1) 


(1— a) (Of + S lorti — vell?) + af (ye) + a (rmx, veti — YE) 


ap 
+ leet = yk|l?. 


Substitute vk+1 — Yk = (1 — a) (Vk — yx) — Mk, we have 


1 
diy = (1 a)ó; + aflu) — grlim? 
+a(1 a) (S llyr v|? 4 (me, ur — ye) ) 
> (L-a) (ffer) — h(k) 9) + af (ye) — glial? 


a(l — a) (Ellyn — vll? + (mis ve — y) - 


Because f (rx) > flyr) + (Vf (yk), tk — Yk) > Fyk) + (Mk, £k — Ye) — Lk) Ater- yell and |er— yell = 
Te. ea < Ê eee | |e eel) < + (1+ ONEN + g(k) + h(k). 


Hence, we can bound ¢;,, by 
o1 2 2k+2 
F(yk) 5p "xl h(k): c 
2 
—c?F+t. Uk) (1+ o) 7 A + g(k) + h(k). 


Recall that 


IA 


Fera) Flom) + sella? < F (me — Vlo) ma) 
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we have 


which means h(-) only need to satisfy 


h(k +1) > h(k) + ek) - E A +g(k)+ h(k) + 2 (17) 


Finally, we only need to make inequalities and be the iteration formulas for arrays 
{9(k) }e>0, {h(k)}k>o with initialization g(0) = h(0) = 0, and we have 


3 5 2 
H vit AP A+i(k) + £ eu (18) 


i(k+1)> 


where i(k) := g(k) + h(k), Vk > 0. Now, Ve < c < 1, define y = 7 and l(k) = sy**+1. We will prove that 
i(k) < Cy*, Vk > 0 for sufficient large C by mathematical induction. 


Obviously, it holds for index 0. Suppose it holds for index less or equal to k. Then we only need to prove 
Vit + ae +5} 
Cat > (c a =) a4 sqft! /N g C72, 


This can be derived by 


2 34/2 +5,/2 
C (7? 2+ vis vi syVA+C. 


ae 


Hence, if we define a, = ws +A, bs = ea) sy, making C > 6? +a, + B./as — A is enough. 


The convergence result follows by inequality (15). 


F Experiments 


F.1 Linear regression 


We illustrate the effectiveness of our proposed algorithms (DEED-GD and its accelerated version A-DEED-GD) 
in frequent-communication large-memory setting on linear regression problem with a 100 by 100 Gaussian 
generated matrix with condition number equals to 16. We focus on star networks, and there are 10 computing 
nodes. We perform 800 epochs on non-accelerated algorithms and 200 epochs on accelerated ones. In each 
update, the stepsize chosen on each computing node 7 is min; 1/L; where L; is the Lipschitz constant for the 
function corresponds to node i. As required in theorem 3.4 in QSGD’s paper (13}, we choose another learning 
rate for QSGD. The experiments are done on a computer with 2 GHz Dual-Core Intel Core i5 processor. 
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We choose the quantization level to be 10000 in QSGD as smaller quantization level lead to larger loss 
values. For DIANA, as suggested by [19], we quantize the weight of each layer separately, and we use either 
the full block size of the weighs or let block size equal to 20. The quantization level in DIANA algorithms 
is equal to the block size d; for some vector i and the parameter a in DIANA is chosen to be min; 1/./dj. 
For DEED-GD, we choose s = 0.01 and c = 0.95 which is the convergence parameter of the baseline method 
(GD) using a stepsize min; 1/L;. Here s,c are parameter controlling the maximal error defined below. For 
A-DEED-GD, we choose s = 0.1 and c = 0.82 which is the convergence parameter of the accelerated baseline 
method (A-GD) using a stepsize min; 1/L;. 

In our algorithms, to do quantization with maximal error e(= s- c*t") at iteration k, we consider the 
following algorithm. For a vector w, we first compute a vector Ù := wi Next, we encode each coordinate 
of ù, say wÙ; into |w;| and |w,;| +1 unbiasedly. We call the new vector ù. Finally, we compute the quantized 
vector v := et It is obvious that using this quantization method the error is bounded by e. 

To encode the integer vector v into an integer, we use the method introduced in QSGD. We refer Elias 
encoding, a map from positive integer to non-negative integer. First of all, We use Elias to encode the first 
non-zero element of 0 (use 1 bit to encode sign and use other bits to encode absolute value) and its position. 
Then we literately use Elias to encode the next non-zero element and its distance from the previous position. 
It works perfectly especially when © is sparse. 


As mentioned in table [I] we noticed that there could be two settings for algorithm only does quantization 
on computing nodes like DIANA. In star network setting, the center node transmits full vector (with 
no quantization) to computing nodes in broadcasting term. This would cost extra 32dN bits. In fully 
connected network setting, computing nodes broadcast to each other and update information separately. 
This would lead to an extra N — 1 factor on total number of bits since each computing node should 
broadcast to N — 1 other nodes. One another method is to time 2 instead of time N — 1 on total number 
of bits from computing nodes to center node to compare them with DEED series. It is reasonable for 
the following reasons. 1) In DEED series, we have two communications for each computing nodes in each 
iteration: sending message and receiving message. So, in this case what we really compare is proportional to 
number of iterations x number of bits per communication. 2) The “x2” scheme counts less bits than both 
star network setting and fully connected network setting. If we can beat DIANA and other algorithms in 
this setting, it means our framework is essentially better than their framework, i.e. better than them in any 
settings. Notice that this is only a method of counting bits, not a method of communication. 


The performance analysis has already provided in Section [5] and it shows that our algorithms (DEED-GD 
and its accelerated version A-DEED-GD) save the most number of bits in communication without scarifying 
the convergence speed. We also run the experiments on 5 different random seeds and the results is shown in 
Figure B] In Figure |2} the shaded regions line up with the maximal and minimal loss values at each epoch 
among the 5 different runs, and the regions are too small to visualize due to small variance. 


F.2 Image classification on MNIST 


We evaluate the effectiveness of the proposed algorithms via training a neural network on the MNIST 
dataset for image classification. MNIST consists of 60,000 28 x 28 pixel training images containing a single 
numerical digit and an additional 10,000 test examples. Our neural network consists of one 500-neuron fully- 
connected layer followed by a ten unit softmax layer for classification, and the layer used reLU activations [44]. 
The experiments are performed on a using NVIDIA GeForce GTX1080 GPU, and the models are distributed 
over 6 computing servers, where each of the servers have access to 10,000 training images. In large-memory 
setting, each server uses its own 10,000 images to update the models, while in small-memory setting, each 
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Figure 2: Compare loss values of each algorithms among 5 different runs 


server uses randomly selected 1666 images among its own 10,000 training images as a minibatch. We train the 
models for 200 epochs in large-memory setting and 100 epochs for small-memory setting. The training time 
is approximately 9 hours for each algorithm in small-memory setting and 6 hours for large-memory setting. 


We compare our algorithms with QSGD [13], DIANA |{19], DoubleSqueeze [25], and Terngrad in 
frequent communication settings. For all the algorithms except DIANA, in every update, we vectorize the 
gradient or the gradient difference of the weight matrices of the neural network, concatenate these vectors as 
a large vector and do quantization, and reshape the quantized vector into the original shapes of the weight 
matrices for updating. For DIANA, as suggested by [19], we quantize the weight of each layer separately, and 
we use two different block sizes in quantization (use the full block size of the weighs, or let block size equal to 
128). As suggested by [19], the quantization level in DIANA algorithms is equal to the block size d; for some 
vector i and the parameter a in DIANA is chosen to be min; 1/\/d;. In large-memory setting, we train our 
algorithm DEED-GD with the parameters e = 0.1, s = 25, and in small-memory setting, we train DEED-SGD 
with parameters e = 0.2, s = 25. Here the maximal error € := Gt: We use the same encoding algorithm as 
described in linear regression. In both large-memory and small-memory setting, we present the results for 
choosing 4-bit quantization for QSGD. For DoubleSqueeze, we perform two kinds of quantization as discussed 
in [25]: top-k compression and 1-bit compression. For fair comparison, we use the same stepsize for all the 
algorithms, where the stepsize for large-memory setting is 0.25 and is 1.18 for small-memory setting. These 
stepsizes are chosen as the loss curves of the baseline methods (GD and SGD) are smooth and the baseline 
methods achieve fairly high accuracy in testing. As shown in Figure [3(a)|and Figure our algorithms as 
well as QSGD, DIANA and Terngrad achieve the same loss values as the baseline methods (GD and SGD) in 
both large-memory setting and small-memory setting. 


To compare the efficiency of each algorithm, we compute the total number of bits throughout the training. 
As large integers are less frequent in encoded vectors [13], we use Elias integer encoding to save bits in 
communication for all algorithms in comparison. Notice that Terngrad, DIANA and QSGD only perform 
quantization on computing nodes and will typically use 32-bit precision to encode the vectors which are sent 
from the center node in a star network. To be fair in comparison, we use the scaling technique proposed 
in and use log,(1 + 2 * N) » d bits to encode the vectors sent from center node where N is number of 
computing servers and d is the dimension of the vector. This number is significantly smaller than 32 * d 
unless N > 230, For DIANA and QSGD, we let computing nodes to share information to each other so they 
do not need to broadcast via the center node which will cause 32 « d bits for each update. Under this setting, 
the bits communicated in each update is B * (N — 1) where B is the number of bits to communicated from 
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Figure 3: Compare loss values of each algorithms 


the computing node and N — 1 is the number of other computing nodes that need to communicate to. In 
most cases, the bits computed in this way is fewer than using 32-bit precision to encode the vectors which are 
broadcast from the center node. 


Results of Efficiency. To illustrate the efficiency of our proposed algorithms, we plot the number of bits 
vs the testing accuracy in Figure [4(a)] and [4(b)] for frequent large-memory setting and frequent small-memory 
setting. In both figures, the curve that is located on the left-most corresponds to our proposed algorithms, 
DEED-GD and DEED-SGD. This means that our proposed algorithms use the fewest total number of bits to 
achieve the accuracy. 


To better illustrate how many bits we can save from other algorithms, we present the total number of bits 
in Table [3] and Table [4] and compute the ratio between the number of bits required by other algorithms and 
the number of bits required by the proposed algorithms in the fourth column of the Table. For example, in 
Table [4(a)| the number 10.44 means DEED-GD takes 10.44 times fewer the number of bits than QSGD to 
achieve the accuracy. In theory, DIANA and our proposed algorithm DEED-GD both have linear convergence 
rate, but our experiments show that we can take 190.28 times fewer the number of bits than DIANA (block 
size equals to 128) to achieve the similar performance in training. In addition, our algorithms achieve the 
highest testing accuracy in the final epoch as shown in the second column of the table, and the accuracy is 
comparable or even higher than the ones achieved by the non-quantized baseline algorithms (GD achieves 
91.7% and SGD achieves 97.37%). 


To ensure the performance are reproducible, we also train the models under different random seeds 
and choose different parameters within certain ranges. For example, we train the models by our proposed 
algorithm using the parameters e is chosen from [0.1, 0.3], s is chosen from {16, 25,32}. The comparison we 
discussed above is still valid under these changes. We also run experiments for different quantization levels 
(e.g. using 2-bit quantization or 3-bit quantization) for QSGD and DIANA, but they cannot achieve the same 
testing accuracy with the same number of epochs as using 4-bit quantization, so we do not discuss these 
results here. 
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Algorithm Testing accuracy ‘Total number of bits Ratio 
DEED-GD 91.86% 3.34 x 107 1.00 
QSGD 91.85% 3.49 x 108 10.44 
DIANA (full block) 91.82% 4.40 x 108 13.17 
DIANA (block size = 128) 91.82% 6.35 x 10° 190.28 
DoubleSqueeze top-k 90.47% 6.21 x 108 18.59 
DoubleSqueeze 1-bit 90.31% 1.11 x 10° 33.34 
TernGrad 91.84% 2.72 x 10? 81.44 


Table 3: DEED-GD saves bits in communication for large-memory setting 


Algorithm Testing accuracy Total number of bits Ratio 
DEED-SGD 97.33% 1.04 x 108 1.00 
QSGD 97.31% 6.52 x 108 6.25 
DIANA (full block) 97.30% 6.83 x 108 6.55 
DIANA (block size = 128) 97.27% 1.12 x 101° 106.95 
DoubleSqueeze top-k 96.67% 1.81 x 10° 17.31 
DoubleSqueeze 1-bit 95.96% 3.34 x 10° 32.01 
TernGrad 97.44% 8.16 x 10° 78.20 


Table 4: DEED-SGD saves bits in communication for small-memory setting 
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